-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watcher add stopped listener #43939
Watcher add stopped listener #43939
Conversation
@elasticmachine update branch |
Pinging @elastic/es-core-features |
I may need to adjust the time out or tests based on a few runs through CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a few nitpicks, but the racy issue should be solved IMO
...ck/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/WatcherLifeCycleService.java
Outdated
Show resolved
Hide resolved
...ck/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/WatcherLifeCycleService.java
Outdated
Show resolved
Hide resolved
...ck/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/WatcherLifeCycleService.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/WatcherService.java
Outdated
Show resolved
Hide resolved
...lugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/execution/CurrentExecutions.java
Outdated
Show resolved
Hide resolved
...ugin/watcher/src/test/java/org/elasticsearch/xpack/watcher/WatcherLifeCycleServiceTests.java
Outdated
Show resolved
Hide resolved
something that dawned me this morning: when changing this is the fact that there is no guarantee about order the execution when passed to the generic executor, so passing in |
I'm not sure I follow. https://github.com/elastic/elasticsearch/pull/43939/files#diff-5831c85834676ac07259e13086bf1a95R108 only allows transitions from STOPPING -> STOPPED , and the lines above only allow transitions from STARTED -> STOPPING . It is possible for the STOPPED to never be reached (if Watcher was restarted while waiting for the async to clean up to complete ). The tests should not allow this. |
@spinscale @martijnvg - Pending clarification of #43939 (comment) , This should be ready for another pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few question mainly for my own education.
...ck/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/WatcherLifeCycleService.java
Outdated
Show resolved
Hide resolved
// if this is not a data node, we need to start it ourselves possibly | ||
if (event.state().nodes().getLocalNode().isDataNode() == false && | ||
isWatcherStoppedManually == false && this.state.get() == WatcherState.STOPPED) { | ||
isWatcherStoppedManually == false && isStoppedOrStopping) { | ||
this.state.set(WatcherState.STARTING); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we only be able to set the state to STARTING
if the current state is STOPPED
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to stay passive with the old behavior. If we kept this to only STOPPED
with this change, it would mean you can not restart watcher while any watches are currently inflight.
The pre-existing design allows you "stop" and restart it immediately and re-process the same watch, any inflight watches will finish up. If I didn't allow STOPPING
here you would have to wait for all inflight watches to finish up and if one were were stuck (for what ever reason) could cause the in-ability to ever reach a fully STOPPED
state.
// if this is not a data node, we need to start it ourselves possibly | ||
if (event.state().nodes().getLocalNode().isDataNode() == false && | ||
isWatcherStoppedManually == false && this.state.get() == WatcherState.STOPPED) { | ||
isWatcherStoppedManually == false && isStoppedOrStopping) { | ||
this.state.set(WatcherState.STARTING); | ||
watcherService.start(event.state(), () -> this.state.set(WatcherState.STARTED)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we guard against going into STARTED state from a state other than STARTING?
(this.state.compareAndSet(WatcherState.STARTING, WatcherState.STARTED);
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to avoid changes to the START*
states/behavior. This change should only impact the STOPPED
state for a manual (e.g. via the API) request to stop.
@elasticmachine update branch |
As of elastic#43939 Watcher tests now correctly block until all Watch executions kicked off by that test are finished. Prior we allowed tests to finish with outstanding watch executions. It was known that this would increase the time needed to finish a test. However, running the tests on CI can be slow and on at least 1 occasion it took 60s to actually finish. This PR simply increases the max allowable timeout for Watcher tests to clean up after themselves.
As of #43939 Watcher tests now correctly block until all Watch executions kicked off by that test are finished. Prior we allowed tests to finish with outstanding watch executions. It was known that this would increase the time needed to finish a test. However, running the tests on CI can be slow and on at least 1 occasion it took 60s to actually finish. This PR simply increases the max allowable timeout for Watcher tests to clean up after themselves.
This test is believed to be fixed by elastic#43939
These tests are believed to be fixed by elastic#43939 closes elastic#45582 and elastic#43975 These tests are currently only muted master. Jul 23 - muted in master (Test transform scripts are updated on execution) Aug 14 - muted in master (Test condition scripts are updated on execution) Aug 15 - last recorded failure (from either test on any branch) Aug 16 - elastic#43939 commited to master Aug 22 - elastic#43939 backported to 7.x
This test is believed to be fixed by elastic#43939 closes elastic#43889 -------- This test is currently only muted master. Jul 10 - muted in master Aug 16 - elastic#43939 commited to master Aug 22 - elastic#43939 backported to 7.x Aug 23 - last recorded failure (on any branch)
This test is believed to be fixed by elastic#43939 closes elastic#43988
This test is believed to be fixed by elastic#43939 closes elastic#43988
This test is believed to be fixed by elastic#43939 closes elastic#40178 -------- Note - this test was run for 24+ hours on CI hardware with no failures.
This test is believed to be fixed by elastic#43939 closes elastic#40178
This test is believed to be fixed by elastic#43939 closes elastic#33185
When Watcher is stopped and there are still outstanding watches running
Watcher will report it self as stopped. In normal cases, this is not problematic.
However, for integration tests Watcher is started and stopped between
each test to help ensure a clean slate for each test. The tests are blocking
only on the stopped state and make an implicit assumption that all watches are
finished if the Watcher is stopped. This is an incorrect assumption since
Stopped really means, "I will not accept any more watches". This can lead to
un-predictable behavior in the tests such as message : "Watch is already queued
in thread pool" and state: "not_executed_already_queued".
This can also change the .watcher-history if watches linger between tests.
This commit changes the semantics of a manual stopping watcher to now mean:
"I will not accept any more watches AND all running watches are complete".
There is now an intermediary step "Stopping" and callback to allow transition
to a "Stopped" state when all Watches have completed.
Additionally since this impacts how long the tests will block waiting for a
"Stopped" state, the timeout has been increased.
Related: #42409